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Methods for Classifying High-Dimensional Biological Data 



CROSS-REFERENCE TO RELATED APPLICATIONS 

This application claims the benefit of U.S. Provisional Patent Application No. 
5 60/233,546, filed September 19, 2000, the contents of which are hereby uicorporatedby 
reference for all purposes. 

STATEMENT REGARDING FEDERALLY SPONSORED RESEARCH 

The U.S. Government has certain rights to the invention pursuant to contracts ACI 
96-19020 and DMS 98-70172 awarded by the National Science Foundation and contract 
10 P43 ES04699 awarded by the National Institute of Environmental Health Sciences, 
National Institutes of Health. 

REFERENCE TO A MICROFICHE APPENDIX 

Not applicable. 

BACKGROUND OF THE INVENTION 

1 5 The invention pertains to the field of biostatistics, and more particularly to 

melhods of classifying high dimensional biological data. 

With the wealth of gene expression data firom microairays (such as high density 
oligonucleotide arrays and cDNA arrays) prediction, classification, and clustering 
techniques are used for analysis and interpretation of the data. Developments in the field 

20 of proteomics are expected to generate vast amounts of protein expression data by 

quantitating the amounts of a large number of different ijroteins within a cell or tissue. 
One can easily imagine carrying out experiments to generate large volxmies of data that 
correlate, e.g. , the. expression patterns of proteins, mRNAs, cellular complements, of 
membrane lipids, or other metabolic factors to a biologic response (e.g., sensitivity of a 

25 cell to a drug), to one of two biologic state (e.g., normal or disease states), or to one of a 
number of biologic states (e.g., one of a number of different tumor types.) One challenge 
of dealing with ttie large numbers of variables sampled using microairay technologies is 
developing methods to extract meaningful information firom fiie data that can be used to 
predict or classify the biological state or response of a sample. Such methods would 
30 dramatically improve our ability to apply genomics or proteomics data to improve 
medical diagnoses and treatments. 
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The use of global gene expression data from microarrays for human cancer 
research is relatively new (DeRisi et aL, 1996). However, since the introduction of DNA 
microairay technology to quantitate thousands of gene expressions simultaneously 
(Schena et al, 1995; Lockhart et al, 1996), there have been increasing activities in the 
5 area of cancer classification or discrimination. For example, Golub et aL (1999) used a 
weighted voting scheme for the molecular classification of acute leukemia based on gene 
expression monitoring from Affymetrix high-density oligonucleotide arrays. Also using 
Affymetrix oligonucleotide arrays Alon et al (1999) used a cluster technique based on 
the deterministic-annealing algorithm to classify cancer and normal colon tissues. Scherf 
10 etaL (2000) and Ross et al (2000) used classical clustering techniques such as average- 
linkage to cluster tumor tissues fix^m various sites of origin: colon, renal, ovarian, breast, 
prostate, lung, central nervous system as well as leiikemias and melanomas. The popular 
method of support vector machines ("SVM*0 introduced by Vapnik was applied to the 
classification of tumor and normal ovarian tissues by Furey et al (2000). The use of gene 
15 expression profiles to distinguish between negative and positive for BRCAl mutation (as 
well as negative and positive for BRCA2 mutation) in hereditary breast cancer was 
described by Hedenfrdk et aL (200 1). Some other important applications in human cancer 
include classifying diffuse large B-cell lymphoma CT>LBCL") (Alizadeh et a/., 2000), 
mammary epithelial cells and breast cancer (Perou et ahy 1999,.2000) and skin cancer 
20 melanoma (Bittner et al, 2000) based on gene expression data. Dudoit et al. (2000) and 
Ben-Dor et aL (2000) presented a comparative studies of classification methods applied 
to various cancer gene expression data. These techniques have also helped to identify 
previously undetected subtypes of cancer (Gohib et aL, 1999; Alizadeh et aL, 2000; 
Bittner et al,, 2000; Perou et aL, 2000). The problem of deriving useful "predictions" 
25 from high dimensional data may come in various forms of applications as well, such as, 
eg., using expression array data to predict patient survival duration with germinal center 
B-like DLBCL as compared to compared to those with activated B-like DLBCL using 
Kaplan-Meier survival curves (Ross et aL, 2000). 

Gene expression data &om DNA naicroarrays is characterized by many measured 
30 variables (genes) on only a few observations (experiments), although both the number of 
experiments and genes per experiment are growing rapidly. The number of genes on a 
single array usually is in the thousands, so the number of variables p easily exceeds the 
number of observations N. Although, the number of measured genes is large there may 
only be a few underlying gene components that accomt for much of the data variation; 
3 5 for instance, only a few linear combinations of a subset of genes may account for nearly 
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all of the response variation. Unfortunately, it is exceedingly diffictilt to determine which 
genes are mennibers of the subset given the large number of genes, and tJie small 
number of observations, N. The feet that experiments such as, microarray 
e3q)eTiments that are characterized by many measured variables genes), on only a 
5 relatively few observations or samples, JV, renders all statistical methods requiring 'N>p 
to be of no direct use. 

While this problem has been described with reference to gene expression data 
from DNA microarrays, similar issues arise with any type of biological data in which the 
number of variables measured exceeds the niunber of observations, and the methods of 

10 the invention are applicable to many different types of biological data. Thus, there is a 
need in the art for methods of dealing with such ^liigh dimensionar' data (z.e., data tiiat 
are statistically underdetennined because there are fewer observations, i^, than the 
number of variables,/?) to allow classification of biological samples. Methods are needed 
for binary classification to discriminate between two classes such as normal and 

1 S cancer samples, and between two types of cancers) based on high dimensional data 

obtained from the sample, Metiiiods also are needed for classification or discrimination of 
more than two groups or classes ("multi-class")- TTie need for multi-class discrimination 
methodologies is apparent in many microarray experiments where various cancer types 
are simultaneously considered. The present invention addresses these and other 

20 shortcomings in the art by providing statistical methods of analyzing biological data to 
permit accurate classification of samples. The invention uses the method of partial least 
squares ("PLS") (for binary classification) or the method of multivariate partial least 
squares C'MPLS") (for midti-class classification) as a dimension reduction technique, 
followed by a classification step. 

25 BRIEF SUMMARY OF THE INVENTION 

It is an object of the invention to provide metiiods for classifying biological 
samples from which higji dimensional data has been obtained. The methods of the 
inventionpermit binary classification (i.e., assignment of a sample to one of two classes), 
as well as multi-class classification (f.e., assignment of a sample to one of more than two 
30 classes). The method involves analyzing data obtained from biological samples with 

known classifications, carrying out a dimension reduction step on the data, and using the 
reduced data as input data in a classification step to generate a model useful for predicting 
the classification of a biological sample with an unknown classification. In one 
embodunent of the invention, the classification model is bmary, z.e., it accounts for only 
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two classes. In this embodiment, the method preferably is carried out using PLS 
dimension reduction. Classification is then preferably carried out using logistic 
discrimination CTD"). In another preferred embodiment of the mvention, classification 
is carried out using quadratic discriminant analysis ("QDA")- In another embo<Ument, the 

5 classification model permits assignment of the vmknown sample to one of more than two 
classes. In this multi-class embodiment of the invention, dimension reduction is 
preferably carried out using multivariate partial least squares ('"MPLS") dimension 
reduction, and classification is achieved with polychotomous discrimination ("PD") or 
QDA. In yet another preferred embodiment, a subset of the p variables may be selected 

10 according to standard t-statistics, or pairwise comparison and the analysis of variance 
(ANOVA) prior to the dimension reduction step. The methods of the present invention 
also peraiit assessment of Ihe confidence associated with any specific prediction by 
examining the estimated conditional class probability, tc of a sample. 

BRIEF DESCRIPTION OF THE DRAWINGS 

15 Fig. 1: Ulustrationof dimension reduction for NC160 data. 

Fig. 2: Polychotomous discrimination using multivariate partial least squares 
CTMPLS") components and principal components ("PCs**). 

Fig. 3: Quadratic discriminant analysis C'QDA") with leave-out-one cross 
validation ("CV") using MPLS components and PCs. 
20 Fig. 4: QDA with direct-resubstitution using MPLS components and PCs. 

Fig. 5: MPLS components in PD, QDA-direct resubstitution and QDA-CV. 

DETAILED DESCRIPTION OF THE PREFERRED EMBODBMENTS 

In situations where the nunober of observations, Ny is less than the number of 
variables,/?, when JV< p), dimension reduction is needed to reduce the higjip- 
25 dimensional variable space to a lower isT-dimensional component space. Under similar 
data structure in the field of chemometrics, the method of partial least squares CTLS") 
has been found to be a useful dimension reduction technique. PLS has been useful as a 
predictive modeling regression method in the field of chemometrics. For example, in 
spectroscopy one may be predicting chemical composition of a conopound based on 
30 observed signals for a particular wavelength, where the number of wavelengths 

(variables) is large. (Applications of PLS are abundant in the Journal of Chemometrics 
(John Wiley) and Chemometrics and Intelligent LaboraioTy Systems (Elsevier), for 
example.) An introduction to PLS regression is given by Geladi and Kowalski (1986). 
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The use of PLS in calibration can be found in Martens andNaes (1989). Some theoretical 
aspects and data-analytical properties of PLS have been studied by chemometridans and 
statisticians (de Jong, 1993; Frank and Friedman, 1996; Helland, 1988; Helland and 
Almoy, 1994; Hoskuldsson, 1988; Lorber, Wangen andKowalski, 1997; Phatak, Reilly, 
andPenlidis, 1992; Stone and Brooks, 1990; Garwaithe, 1994). 

Iei one aspect, the present invention provides analysis procedures for binary 
classification ("prediction") of biological samples such as human tumor samples based on 
high dimensional data such as is obtained from microarray gene expressions 
measurements- Here, the response variable is a binary vector indicating, e.g., normal or 
tumor samples, for example. This procedure involves dimension reduction using PLS and 
classification using methods such as logistic discrimination ("LD'*)» quadratic 
discriminant analysis ("QDA"), or linear discriminant analysis. That is, the procedure 
involves two steps, a dimension reduction step followed by a classification step. 
According to the methods of the present invention, flie PLS coinponents may be modified 
prior to their use in classification methods. Such modifications include, e.g., singular 
value decomposition, or linear combinations of univariate logistic regression. The 
methods may optionally make use of a preliminary screening step to select informative 
variables prior to dimension reduction by PLS. For binary classifications, t-statistics 
provides a convenient method for screening. These methods of the invention are 
illxistrated by application to five different microarray data sets involving various human 
tumor san5)les: (1) normal versus ovarian tumor samples, (2) acute myeloid leukemia 
C*AML") versus acute lymphoblastic leukemia C*ALL"), (3) diffuse large B-cell 
lymphoma ("DLBCL") versus B-cell chronic lymphocytic leukemia C'BCLL'^), (4) 
nomial versus colon tumor samples and (5) npn-small-celHung-carcinoma C*NSCLC") 
versus renal. To assess the stability of the prediction results and methods we used re- 
randomization studies (as described in the Methods section, below). 

In another aspect, the present invention provides analysis procedures fiDr multi- 
class classification ("prediction") of biological samples such as human tumor samples 
based on high dimensional data such as is obtained from microarray gene expression 
measurements. Here the response variable is a discrete vector indicating, e.g., a particular 
type of cancer. This aspect of the invention involves multivariate partial least squares 
CTvDPLS'O dunension reduction together with a classification step such as polychotomous 
(Uscrimination ("PD"), quadratic discrimmant analysis ("QDA")» or linear discruninant 
analysis. According to the methods of the present invention, the MPLS components may 
be modified prior to their use in classification methods. Such modifications include, e.g.. 
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singular value decomposition, or linear combinations of univariate logistic regression. 
Preliminary screening to select infonnation variables prior to dimension reduction by 
MPLS may be carried out based on pairwise comparison and the analysis of variance 
(ANOVA) of variables,/?. The multi-class classification methods were applied to four 
cancer gene expression data sets. Specifically^ the methodologies proposed in this paper 
were applied to four gene expression data sets wilii multiple classes: (a) a hereditary 
breast cancer data set with (1) BRCAUmutation, (2) BRCA2-mutation and (3) sporadic 
breast cancer samples, (b) an acute leukemia data set with (1) acute myeloid leukemia 
C'AML"), (2) T^ell acute lymphoblastic leukemia CT-ALL") and (3) B-cell acute 
lymphoblastic leukemia C*B-ALL") samples, (c) a lymphoma data set with (1) dif&ise 
large B-cell lymphoma CTDLBCL"), (2) B-cell chronic lymphocytic leukemia ^BCLL") 
and (3) follicular lymphoma ("FL") samples, and (d) the NC160 data set with (1) 
leukemia, (2) colon, (3) melanoma, (4) renal, and (5) central nervous system ("OMS") 
samples. The midti-class methods of the invention were fijrflier tested using data 
generated fix>m a simulation gene expression model. The simulation model, procedures, 
and results are described in the Simulation Studies section. Most technical details are 
deferred to the Appendix. 

Advantages of the present invention over another well-known dimension 
reduction method, principal components analysis ("PCA") (Massey, 1965; JollifEe, 1986) 
were established by comparing results obtained using PLS or MPLS dimension reduction 
with those firom PCA. PCA is used to reduce the high dunensional data to only a few 
components which explain as much of the observed total variation (such as, e.g., gene 
expression variation) as possible. This is achieved without regards to the response 
variation. Components constructed this way are called principal components CTCs")- In 
contrast to PCA, PLS components are chosen so that the sample covariance between ttie 
response and a linear combination of tiie p predictors (e.g., genes) is maximum. The 
latter criterion for PLS is more sensible since there is no a priori reason why constructed 
components having large predictor variation (e.g^., gene expression variation) should be 
strongjy related to the response variable. Certainly a component with small predictor 
variance could be a better predictor of the response classes. The ability of the dimension 
reduction metihiod to summarize the covariation between predictors such as g^e 
expressions and response classes should, in principle, yield better prediction results. 
Thus, for PCA to be competitive, relative to PLS, one can pre-select the predictors which 
are predictive of the response classes and then apply PCA Otherwise, one might expect 
PLS to give better predictions. Using the leukemia data set of Golub et al (1999) we 
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illustrate a condition that demonstrates the superiority of the PLS dimension reduction 
approach used by the present invention, relative to PCA in predicting response class. 

The organization of this specification is as foUov/s, In the Methods section we 
describe the dimension reduction methods of PCA, MPLS and PLS, the classification 
5 methods of LD, QDA, and PD and predictor selection strategies based on simple t- 
statistics, and pairwise comparison and the analysis of variance (ANOVA). In the 
Methods section we also describe the re-randomization technique used to finther assess 
the prediction metiiods and results. The results fix)m applying the proposed methods to 
microairay data sets are given in the Results section. Also included in the Results section 
10 is the illustration of a condition when PCA fails to predict well relative to PLS . We then. 
conclude and discuss generalizations and other applications of the methods of the 
invention to microarray gene expression and other high dimensional data. 

1. METHODS 

Traditional statistical methodologies for classification (prediction) do not work 
1 5 when there are more variables, p, than there are samples, N. Specifically, for gene 

expression data, the munber of tissue samples is much smaller than the number of genes. 
Thus, methods able to cope with tihie high dimensionality of the data are needed. The 
present invention relies on a novel combination of dimension reduction with traditional 
classification methods, such as logistic discrimination, quadratic discriminant analysis, 
20 and polychotomous discrimination C*PD") for high dimensional gene expression data. 

While the invention is illustrated with respect to gene expression data, the methods of the 
present invention are applicable to any high dimensional biologic data. 

Id. Binary Classification 

PLS is the primary dimension reduction method utilized for binary classification, 
25 although we also consider the related method of PCA for comparison. The approach 
taken here consists of two main steps. The first step is the dimension reduction step, 
which reduces the high dimension^ down to a lower dimension K(K<N). Since the 
reduced dimension, is smaller than the number of samples, in the second step, one 
can apply readily available prediction tools, such as logistic discrimination ('LD") or 
30 quadratic discrimmant analysis ("QDA"). 

We introduce the method of PLS first by briefly describing the well known and 
related method of PCA. Classification methods, namely LD and QDA, are also briefly 
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described. Prior to analysis, gene selection may be usefiil. Hence, we also describe a 
simple gene selection strategy based on t-statistics. 

1.1.1. Dimension Reduction: PCA andPLS 

The goal of dimension reduction methods is to reduce the high/M3imensional 
5 predictor (gene) space to a lower ^-dimensional (component) space. This is achieved by 
extracting or constmctingjS: components in the predictor space to optimize a defined 
objective criterion. PCA and PLS are two such methods. To describe these methods 
some notations are required. LetX be 2a3.Nxp matrix ofW samples andp predictors. 
Also, let y denote the X 1 vector of response values, such as an mdicator of leukemia 
10 classes or normal versus tumor tissues. 

In PCA the goal is to exbact gene conqwneats sequentially which maxunize the 
total predictor (e.g., gene expression) variability, irrespective of how well the constructed 
conq)onents predict cancer classes. Thwe is no a priori reason why con5>onents with hi^ 
total predictor variability {e.g., gene expression) should predict cancer classes weU. In 
15 PCA, orthogonal linear combinations are constructed to maximize tiie variance of the 
Imear combination of the predictor variables sequentially, 

(1) 

subject to the orthogonality constraint 

v'Sv,= 0, for an. l<i<fe^ 
20 J « 

(2) • 
where S = X'X. The maximum number of nonzero components is the rank of X, which is 
the same as the rank of X'X. Often in applications of PCA, the predictors are 
standardized to have mean zero and standard deviation of one. This is referred to as PCA 
25 of the correlation matrix, = (l/(i^ - 1))(X - lxO'(3C- JxO -n,e constructed principal 
components (PCs), satisfyiag the objective criterion (1) are obtained fiom tiie spectral 
decomposition of R, 

(3) 

30 where ^ ~ (vi, - • . , vn_i) are tiie corresponding eigenvectors. The jth PC is a linear 
combination of &e original predictors, Xv,. Roughly, the constiiicted components 
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summarize as much of the original p predictors' information (variation), irrespective of 
the response class information. 

Note that maximizing the variance of the linear combination of the predictors 
(e.g., genes), namely var(Xv), may not necessarily yield components predictive of the 
response variable (such as leukemia classes) because the extracted PCs do not depend on 
the response vector y, indicating, the class to which the sample belongs For this reason, a 
different objective criterion for dimension reduction may be more appropriate for 
prediction. 

In contrast to PCA, PLS (orthogonal) components are constructed to maximize 
the sample covariance between the response values (y) and the linear combination of the 
predictor (e.g., gene expression) values (X). The objective criterion for constructing 
components in PLS is to sequentially maximize the covariance between the response 
variable and a linear combination of the predictors. That is, in PLS, the components are 
constructed to maximize the objective criterion based on the sample covariance between y 
and Xw. Thus, we find the weight vector w satisfying the following objective crit^on, 

wjb == aa-gmaaccqy^(Xw^ 

(4) 

subject to the orthogonality constraint 

w'Swj* = 0 for all 1 < j < * 

(5) 

where S = X'X. The maximum number of components, as before, is the rank of X. The 
jth PLS components are also a linear combinations of the original predictors, Xwi. A 
basic algorithm to obtain w is given in the Appendix. 

Based on the different objective criterion of PCA and PLS, namely varCXv) and 
cov(Xw, y), it is reasonable to suspect tiiat if the original predictors (e.g^., genes) are 
already predictive of response classes then the constructed components from PCA would 
likely be good predictors of response classes. Therefore, prediction results should be 
similar to that based on PLS components. Otherwise, one might suspect that PLS should 
perform better than PCA in prediction. We give examples of this in the Results section. 

1.1.2. Classification: LD and QDA 

After dimension reduction by PLS and PCA, the high dunension of p is reduced 
to a lower dimension of if components. Once the K components are constructed we 
considered prediction of the response classes. Since the reduced dimension is now low 
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(K < iV), conventional classification methods such as logistic discrimination and quadratic 
disctiniinant analysis can be applied. 

Let X be the colimm vector ofp predictor values and denotes the binary response 
value. For instance, y = 0 for a normal sample, = 1 for a tumor sample and x is the 
5 corresponding expression values of p genes. In logistic regression, the conditional class 
probability, n = P{y = 1 1 x) = P (tumor given gene profile x), is modeled using the 
logistic functional form, 

(6) 

0 The predicted response probabilities are obtained by replacing the parameter ^ with its 

maximum likelihood estimate (MLE) p . The predicted class of each sample (as a 

normal or tumor sample) is V — -^C^ > 1 - '^r)^ where/(-) is the indicator function; 
1(4) = 1 if condition A is true and zero otherwise. That is, we classify a sample as a tumor 
(y = 1) if the estimated probability of observing a tumor sample given the gene expression 

15 profile, X, is greater than the probability of observing a normal sample with tiie same gene 
profile. This classification procedure is called logistic discrimination ("LD")- As 
inentioned earlier, LD is not defined if iV< p. Thus, in order to utilize the LD procedure, 
we need to replace the original gene profile, x, by the corresponding gene component 
profile in the reduced space, obtained firom PLS or PCA. 

20 Another usual classification method is quadratic discriminant analysis ("QDA") 

based on the classical multivariate normal model for each class: 

x|t;'=.fc'^Np(Sjb,iL4jt), xeW and*'^^^^*-"- For binary classification, G= 1. The 
(optimal) classification regions are 

Rk = € : Pfc/ft(x) > ViMx%4 

25 (7) 

where^ is the probability distribution fonction ("pdf' ) of x 1 = A: given above andp^ = 
P(y = it). This is equivalent to classifying a given sample with gene expression profile x 

into the class with max fe,(xV = 0.1 G},, where 3,(x)= x'A,x+c'x+c, with 

Ai = -0.5SrS.ci = Srift-,^ and ^ = ^SP* ~ O-SlogSi - 0.5/4SrVi. The posterior 
30 probability of membership in class fcis 'Tfc » P(v f= fcW = expfe(x)]/E^«pfe(x)l, 
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Again, the fiill gene projBle, x, is replaced by the corresponding gene component profile 
in the reduced space obtained from PLS or PCA. 

Further details on QDA and other classical classification methods can be found in 
Mardia, Kent, and Bibby (1970), Johnson and Wichem (1992) and Hury (1997). Details 
S on logistic regression can be found in Hosmer and Lemeshow (1989) and McCuUagh and 
Nelder(1989). 

1.1.3. Gene Selection 

Althou^ the two-step procedure outlined above, namely dimension reduction via 
PLS followed by classification via LD or QDA, can handle a large number (thousands) of 
10 genes, only a subset of genes are of interest in practice. Even after gene selection, often, 
the number of genes retained is still larger than the number of available samples. Thus, 
dimension reduction is still needed. It is obvious that good prediction relies on good 
predictors, hence a mediod to select the genes for prediction is necessary. For two-class 
prediction, selection and ranking of the genes can be based on simple t-statistics 

15 ^^""^^ 



(8) 

where ®fe ^ is the size, mean and variance, respectively, of class A; ^ = 1, 2, 
For each gene, a t value is computed. We retain a set of the top j?* genes, by taking 
genes with the largest positive t values (corresponding to high expression for class 1) and 

20 p*tl genes with smallest negative t values (£.e., ttiose negative t values finrthest from 
zero)(corresponding to high expression for class 2). 

We cairied out selections to obtain p* = 50 genes for the ovarian, lexikemia, 
lymphoma, colon, andNCI60 data. The selection revealed that for the leukemia data set 
the gene profiles or patterns show a clear dififerential expression relative to AML/ALL. 

25 This is suggestive of the well-known separability of AML/ALL leukemia classes based 

on gene expression in this data set However, this differentially expressed pattern was not 
as clear for normal and ovarian tumor tissue samples or normal and colon tissue samples. 

1.1.4. Assessing Prediction Methods and Results 

Following gene selection and dimension reduction, we predicted the response 
30 classes. The observed error rate can be used to give a rougjh assessment of a method 

relative to another. The strength or "confidence" associated with any specific prediction 
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(z.£., for each sample) can be assessed by examining &e estimated conditional class 
probability tc described above. 

It is also important to get an idea of how the proposed method will perform in 
light of new data. However, new data are usually not available, so re-randomizatian 
5 study is an alternative. For re-randomization studies a relatively large sample size, iV, is 
needed. If there are sufficient samples, we carry out a three step procedure to assess the 
prediction methods. First, we randomly form a training data set consisting ofN\ of the N 
samples. These iVi samples in the training data set are used to fit the model. The 
rCTiaining Nz—N^Nx samples are saved for model validation (testing). That is, the fitted 
1 0 model is tested on the N2 samples not used to fit the model. This is referred to as out-of- 
sample prediction. 

In the second step, a model is fit to flie training data, and Ihe fit to the training 
data is assessed by leave-out-one cross-validation ("CV")- That is, one of the Ni samples 
is left out and a model is fitted based. on all but the left out sample. The fitted model is 

1 5 then used to predict tiie left out sample. Leave-out-one CV is used for each of the Ny 

samples in the training data set. This provides some protection against overfitting, but it 
is still possible accidentally to select a model that fits the training data especially well due 
to capitalizing on chance. 

The third step assesses the stability of the overall results fiom steps 1 and 2 by re- 

20 randomizations. In the re-randomization step, N[ of the total N samples are randomized 
into a training data set (with the remaining N2 samples forming the test data set) and then 
the out-of-sample and leave-out-one CV prediction is repeated on this permuted data set.. 
Averages of prediction rates over repeated re-randomizations can be used to assess the 
stability of prediction results. 

25 We carried out the re-randomization procedure for the leukemia and lymphoma 

data sets which contain enough samples. For the ovarian andNCI60 data sets, which 
contain few samples, we performed only leave-out-one cross-validation ("CV") 
prediction. 

2.i. Multi^class dassiflcaUon 

30 Suppose that a qualitative response variable y takes on a finite number of 

(unordered) values, say 0, 1, . . . , G often referred to as classes (or groups). That is, y 
indicates the cancer type or class 0, 1, . . , or G, for instance. The problem of mxilti-class 
cancer classification is to predict the class membership or cancer class based on a vector 
of gene expression values ^ ~ • • • 1 ^p)'. Most classification methods, such as 
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classical discriminatioii analysis or polychotomous disciiimnatioii are based on the 
requirement that there are more observations (N) than there are explanatory variables or 
genes (p). One strategy to approach the problem of classification when N<pisto reduce 
the dimension of the gene space from p to say JC, where K « N. This is done by 
constructing K gene components and then classifying the cancers based on the 
constructed K gene components. 

The dimension reduction process is illustrated in Figure 1 using the NC160 data 
set consisting of cell lines derived from cancers of various origins. For illustration, we 
have reduced a gene expression matrix, X of size JV xp ~ 35 x 167, to three gene 
components, tj, ts,, is, using multivariate PLS. It can be seen from the 3-dimensional plot 
(Figure 1, bottom) tiiat the three MPLS gene components separate the five cancer classes 
well (leukemia=*, colon=o, melanoma=+, renal=x, and CNS=0). Classification methods 
such as QDA and PD can be used to predict the cancer classes using the K MPLS gene 
components, here ti, tj, and ts. 

2.1,1. Multivariate Dimension ReducHon: Miiltivariate PLS 

When there is more than one response variable, sayj//, . . . > y/, the objective 
criterion for maximization (under orthogonality constraints) in multivariate PLS is: 

cov^(Xw,Yc) 

(9) 

where w and c are unit vectors. Since coV (Xw,Yc) - var(Xw)c6rr^CXw, Yc)var(Yc) one 
can see tiiat using only the correlation term will lead to the well known canonical 
correlation analysis C*CCA"). The MPLS components are denoted by tit and are linear 
combinations of the gene expression values (X) with coefficients given by Wjt (satisfying 
tiie maxicnization criterion (9)). The PLS algorithm to obtain w (and c) is simple and fesL 
(The algorithm can be found in Hoskuldsson (1988), Garthwaite (1994), Helland (1988) 
or in the context of gene expression data, in the Appendix.) 

The response matrix Y in (9) consists of I continuous (or at least ordinal level) 
response variables, which is the setting MPLS was designed jB3r. However, in the 
exemplified context, we have a qualitative response variable y consisting of classes 0,1, . . 
. , G, namely, cancer type 0 through cancer type G, We need to convert or recede the 
response information indicating cancer class, namely y, into a response matrix Y. To do 
this with the G + 1 cancer classes we created G "design variables" representation (or 
"reference cell coding") of y. That is, we define the iV^x G response matrix Y with 
elements jfik ^Ifyi^k) fori= 1, . . , JVandA:= 1, . . , G. We have used 7(4) to denote the 
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indicator function fer event so tiiat 1(A) = 1 if ^ is true and it is 0 otherwise. (Other 
strategies for constructing Y are possible.) For example, if G = 4 (5 cancer classes), then 
the vector y consists of values 0, 1, . . , , 4 indicating distinct cancer classes 0 through 4. 



constructed from the vector of cancer class indicator y. 

2.1.2. Multi-Class ClassificaOon Methods 

In this section we describe two classification methods that can be applied to make 
class prediction after dimension reduction. Poiychotomous discrimination CTD") is a 
generalization of logistic discrimination when there are more than two classes. QDA 
works for two or more classes so no generalization is needed, but we briefly review the 
method for completeness of exposition. We also describe in this section a preliminary 
Tanking and selection of the large number of genes used for the analyses. 

2.1.3. Poiychotomous Discrimination 

Assume fliat the qualitative response variable y can take on fiirite values, 
y = fe, fc e {0,1,. . . ,G}= O. The distribution of J/ depends on- predictors 
— - J por example, the Mi cancer type (y = fc) depends on the p gene expression 
levels ■ • ' ^-^Fin a given experiment. The response variable is a G-valued random 
variable and assume 

that-?r(ifelx:) = Pisf = few >OforaUx€;tC 72^^ and fc 6 C7,. For convenience 
we define the notation 



The response matrix Y corresponding to response vector y is displayed for G = 4 below 



/ON /OOOOX 

• 1 10 0 0 



y=. 2 -^Y= 0 10 0 
. 3 0 0 1 0 

\4/ Vo 0 0 1/ 



Thus, i^T < <iV multivariate PLS gene components, ti,..., t^^, are extracted 
according to (9) using the origLual gene expression matrix, X, and the response matrix, Y, 





(10) 



This is the log of the ratio of ttie probability of a sample with gene expression 
profile X bemg of cancer type ^relative to cancer type 0. Often fliis quantity (g'jt(x))is 
modeled as a linear function of tiie p gene expressions, x. 
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= log (^1^^ = ^fcO + + 0Jcii^2 + " ■ 4- i^fcp% = x'iSfc. 

(11) 

Since ELi'^O^W = 1. we have (1-^(0^)^0^) = Eg=t«P(s>(=t))andit 
foUowsthat'^(0i=^)^tl'*-S2=i®^^'^(''^r\ For 

fe= 1,,., ff.(feg=«xp(flfc(x))(H^^ from exponentiating (10) and 

using the identity ^iM^) = ltr(fcW/^(OlxjV(aix) Notingthat 9o{^). "= Oforall 
X 6 72?+"^ ( ^0 = 0 ) we can summarized tiie conditional class probabilities as 



10 (12) 



This is the probability that a sample wife gene esqpression profile x is of cancer 
class k. We take (12) as the polyohotomous regression model and note that 
irik\jt) s ir(fe|x; ^> ^ function of v = GCp+i; 

15 parameters^* • • > >^ir) (/? e 71<'<'^*)),^th ' * (Logistic 

regression is when K — I so 3; is binary (0 or 1)). 

An estimate of jff is obtained by maximum likelihood estimation ("MLE") and it is 

described in the Appendix. The MLE offi is denoted j3 and it can be obtained (if it exists) 
when there are more samples than there are parameters, i.e. when N>v=' G(p+1). For 
20 example, utilizing dimension reduction and predicting 5 cancer clajsses (G — 4) using 
3 gene components requires N > G(K + 1^ = 4(4) == 16 samples. Thus, after dimension 
reduction we can use PD by replacing the full gene profile x by the corresponding gene 
component profile in the reduced space obtained by MPLS or PCA. 

From the estimated coefBcient vector (i the estimated conditional class 

25 probabilities tt (A | x) ^ = 0, . . . , can be obtained by substituting p into (12). A 

given sample with gene expression profile x is then predicted to be of cancer class k with 
TTiaviTmiTn estimated conditional class probabilityn (fc 1 x) . That is, we classify (or 
predict) a sample as a cancer of class k if the estimated probability of observing a cancer 
of this class given the gene expression profile, x, is higiher than the probability of 

30 observing any other class of cancer given the same gene expression profile. 
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Further discussions of polychotomous regression can be found, for instance, in 
Hosmer and Lemeshow (1989, Chap. 8), Kooperbei^ Bose, and Stone (1997) and Albert 
and Anderson (1984). 

2.1.4. Quadratic Discriminani Analysis 

Another classification method that can be used after dimension reduction is 
quadratic discrimmant analysis C^QDA''). QDA is based on the classical multivariate 

nomial model for each class: = .G.fpor 
binary classification, G=IJ The (optimal) classification regions are 

(13) 

where/A is the probability density function C'pdf ') of x 1 7 = ^ given above 
and j?^ = P{y = A:) . This is equivalent to classi:fying a given sample with gene e^qpression 
profile X into the class with max {9;(x),z = 0,1,...,G}, where g,.(x)=x'AiX+c'x+c, 
with ^ = -0.5S7\ Ci = Vi; and ^ = ^^^S ^ 0-51ogSt - 0.5|4Sr^>«i, The posterior 
probability ofmembership in class A z^^fe = ^(V = *W = expfe0^)]/aocxpfe(^)]- 
As in PD, tiie full gene profile, x, is replaced by the corresponding gene component 
profile in the reduced space obtained from MPLS or PCA. 

2.1.5. Multivariate Gene Selection 

As in binary classification using univariate PLS described above, the multivariate 
two-step procedure can handle the number of genes (p) as large as the estimafed number 
of genes in the human genome. However, for any given classification problem it may be 
advantageous to select the genes which are "good" predictors of the cancer classes. In the 
binary case, preliminary selection and ranking of the genes based on t-scores works well. 
For more than two classes, we ranked and selected the genes for miilti-class prediction as 
follows. Recall that the cancer classes are labeled by {Q^^i - • • , C?> ^ O a?i, . . . rcjy 
are the expression values of a gene across the N samples (arrays). We compared all 

^^i^^ pairwise (absolute) mean differences, l^te "^^l (&r fc^^ ft^ Kl^ € ^ 
critical score 
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(14) 

where MSe (mean squared error) is the estimate of variability jfrom the analysis of 
variance (ANOVA) model with one fector and G + 1 (cancers) groups and t is the 

*a/2,JV-tG?+i) value of the ^-distribution. Each gene (/= 1, ...,/?) is ranked according to 
flie number of times the pairwise absolute mean difference exceeded the critical score. 
Note that this is the least sigmfilcant difference method in multiple comparison. 

For example, the NCI60 data set described in Figure 1 earlier consists of a subset 
of the data with five cancer classes: leukemia, colon, melanoma, renal, and CNS labeled 

respectively as 0, 1, . . , 4 = G. Thus, there are = 10 pairwise absolute mean 
differences to compare and each gene is ranked as 0,1, . . . , or 10. A zero indicates no 
pairwise mean difference among the 5 cancer classes and a ten indicates pairwise mean 
differences among all 10 possible combinations of cancer classes. The /? = 167 genes 
chosen for analysis in Figure 1 are those having ranked as having 7 or more pairwise 
mean differences according to (14). 

3.0 RESULTS 

We demonstrate the usefulness of the binary (z.e., univariate) classification 
methodologies described above to five microarray data sets with various hiunan tumor 
samples: (1) ovarian (Furey et al, 2000), (2) leukemia (Golub et al^ 1999), C3) 
lymphoma (Alizadeh et aL, 2000), (4) colon (Alon et aL, 1999) and (5) Cancer cell lines 
from the NCI60 data set (Ross et al, 2000). Data sets (2),(3) and (5) M^e published data 
sets and are publicly available at http:/ywaldo.wi.nrit.edu/MPR/; 

http;//llmpp.nih.govAymphoma/; and http://genome-www.stanford.edu/ respectively. The 
ovarian data set is yet to be publish but analyzed results were published by Furey et al 
(2000). 

The application of the multi-class (f.e., multivariate) classification methodologies 
is demonstrated by application of the methods to each of fom: gene expression data sets 
consisting of human cancer san^les: (1) hereditary breast cancer, (2) NCI60 cell lines 
derived from cancers of various origins, (3) lymphon:ia, and (4) acute leukemia. These 
data sets are publicly available at http://www.nhgri.nih.gov/DIR/Microarray; 
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http://www.dtp.nci.nilLgov/docs/d^>-da1a±tml; http://llrapp.ii3h.gov/lyniphoina/; 
http://waldo.wi.imt.edu/MPR/. 

J.J. Binary Classification 

3.1.1. Example 1 — Ovarian Data 

5 The microaixay experiments consist of hybridizations of noimal and ovarian 

tissues on arrays with probes for 97,802 cDNA clones. The ovarian data set considmd 
here consists of 16 normal tissue samples and 14 ovarian tumor tissue samples. The 
normal tissue samples consist of a spectrum of normal tissues: 2 cyst, 4 peripheral blood 
l3miphocytes, 9 ovary and 1 liver normal tissue. All normal and tumor samples are 

10 distinct, coming from difTerent tissues (patients). We log transformed all die gene 

expression values due to the highly skewed data, typical of gene expression data. The 
expression of all genes also were standardized to have mean zero and standard deviation 
of one across samples. 

We considered /?* - 50,100, 500, 1000, 1500 genes selected as described in the 

IS Methods section. Since there are few samples, we made leave-out-one CV prediction. 

Classification of the 30 tissue samples based onK^3 gene components constructed firom 
p* genes iising PLS and PC A are given in Table 1. Overall, the classification results are 
good. AH normal and ovarian tumor sanoples were correctly classified using LD with 
PLS and PCs. Results for QDA is the same, except, with PCs one normal (cyst) sample 

20 was misclassified (p* = 50 and 500). Also, different analyses using p* = 1000 and 1500 
misclassified one normal ovarian sample. However, all classification methods using PLS 
gene components are 100% coirect for the ovarian data. Furey et aL (2000) also used 
leave-out-one CV prediction for this data set as well, but using support vector machiaes 
("SVM") (Vapnik, 1998). Although it is not our intent to tune our analyses to theirs to 

25 make eTcact comparisons, a crude observation can be made. Furey et aL reported 3-5 

normal samples and 2-4 ovarian tumor samples misclassified using SVM based on 25 to 
100 genes. (See Table 1 of Furey et aL Furey et al, included another sample tissue fix>m 
the same patient We only use samples firom distinct patients since samples should be 
independent However, inclusion ofthis one extra sample did not change the results 

30 reported here.) 

Table 1: Classification results for normal and ovarian tumor sainples. Given are the 
number of correct classification out of 30 samples (16 normal and 14 ovarian tumor 
samples). 
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p* 




LD 




QDA 


Sample 




PLS 


PC 


PLS 


PC 


Misclassified 


50 


30 


30 


30 


29 


#1 


100 


30 


30 


30 


30 




500 


30 


30 


30 


29 


#1 


1000 


30 


30 


30 


29 


#4 


1500 


30 


30 


30 


29 


#4 



The strength or "confidence" in the predictions made can be assessed by 
examining the estimated conditional class probability, namely n = P{Y '~ | xt),Jt = 0,1 , 
5 where x] is the gene profile (pattern) in the reduced iST-dimensional space. For/?* = 50 
and 100 genes, the estimated conditional probability is essentially one for PLS and the 
lowest rf is 0.973 fiat PCA. This holds for = 1000 and 1500 genes as well. However, 
for/7* = 500 genes, two samples were correctly classified (PCA) with moderate estimated 
conditional class probability of 0.922 and 0.925. Sample 16 is a normal sample from a 
10 white blood cell line <HWBC3) and exhibits characteristics of both normal and tumor 
cells, which makes it a likely candidate for nusclassification. SVM had problems 
classifying this sample (Furey et al, p. 910), but PLS correctly classified this sample as 
normal tissue. 

3.1,2, Example 2 — Leukemia Data 

15 The data set used here is the acute leukemia data set published by Golub et aL 

(1999). The original training data set consisted of 38 bone marrow samples with 27 acute 
lymphoblastic leukemia (ALL) and 1 1 acute myeloid leukemia (AML) (from adult 
patients). The independent (test) data set consisted of 24 bone marrow samples as well as 
10 peripheral blood specimens from adxilts and children (20 ALL and 14 AML). Four 

20 AML samples from the independent data set were from adult patients. The gene 
expression intensities were obtained from Afifymetrix high-density oligonucleotide 
microarrays containing probes for 6,817 genes. We log transformed the gene expressions 
to have a mean of zero and standard deviation of one across samples. No fiirther data 
preprocessing was applied. 

25 We first applied the methods of the invention to the original data structure of 38 

training samples and 34 test samples for^* = 50, 100, 500, 1000, and 1500 genes selected 
as described earlier. The results are given in Table 2. All methods predicted the 
ALU AML class correctly 100% for the 38 training samples using leave-out-one CV. 
Prediction of the test samples using LD based on the training (PLS and PCA) components 

30 resulted in one misclassification: sample # 66. This is based on - 50 genes. This 
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AML sample was also misclassified by Golub et aL (1999) using a weighted voting 
scheme. Participants of the Critical Assessment of Techniques for Microairay Data 
Mining (CAMDA'OO, December 2000) Conference analyzing the leukemia data set all 
misclassified sample #66. Whether the sample was incoirectly labeled is not known. 

Table 2: Classification results for the leukemia data set with 38 training samples (27 
ALL, 1 1 AML) and 34 test samples (20 ALL, 14 AML). Given are ttie number of correct 
classification out of 38 and 34 for the training and test samples respectively. 







Training Data 








Test Data 






(Leave-out-one CV) 






(Out-of-sample) 






LD 


QDA 




LD 


QDA 


p* 


PLS 


PC 


PLS 


PC 


PLS 


PC 


PLS PC 


50 


38 


38 


38 


38 


33 


33 


28 30 


100 


38 


38 


38 


38 


32 


■ 32 


29 30 • 


500 


38 


38 


38 


38 


31 


31 


32 28 


1000 


38 


38 


38 


38 


31 


31 


31 28 


1500 


38 


38 


38 


38 


31 


30 


30 28 



To assess the stability of the results shown in Table 2 we carried out a re- 
randomization study as described in the Methods section. We considered an equal random 
splitting of the A^= 72 samples: N\ = 36 training and N^ = 36 test samples. The analysis 
above was repeated for 100 re-randomizations. Table 3 gives the average classification 
rates over the 100 re-randomizations. LD and QDA prediction based on PLS gene 
components resulted in near perfect classification (between 99% and 100% correct) for 
the training samples using leave-out-one CV, PCs fered slightly worse (between 90% and 
97% correct). This is based on all considered. For the test samples, PLS gene 
components in LD performed better than PCs. However, both PLS and PCs performed 
similarly in QDA, 



Table 3: Classification results for re-randomization study of the leukemia data set witii 
36/36 splitting. Each value in the table is the correct classification percentage averaged 
over 100 re-randomizations. Perfect classification is 36. 



PLS 



Training Data 
(Leave-out-one CV) 
LD QDA 



LD 



Test Data 
(but-of-sanjplc) 



PC 



PLS 



PC 



PC 



PLS 



QDA 
PC 



50 


36.00 


34.08 


35.99 


3432 


34.72 


33.66 


34.63 


34.63 


100 


35.88 


33.29 


35.89 


34.89 


34.30 


32.92 


34.80 


34.58 


500 


36.00 


34.32 


36.00 


35.09 


34.73 


34.08 


34.53 


34.60 
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1000 


36.00 


32.95 


36.00 


34.57 


34.82 


32.50 34.77 


34.09 


1500 


36.00 


32.51 


36.00 


33.79 


34.71 


32.11 34.67 • 


33.66 



We also classified the samples based on the 50 "predictive" genes set reported by 
Golub et al. Leave-out-one CV predictions of the 38 training samples using QDA and LD 
with PLS gene components resulted in 100% correct and 36/38 for PCs. Based on only 

5 the training components, out-of-sample predictions of the 34 test samples were also made. 
LD with PLS gene components resulted in one misclassification (#66). Golub ei al 
associated with, each prediction a "prediction strength" ("PS"). (For details, see Golub et 
al) Five test samples were predicted with low (PS < .30) to borderline prediction 
strength: samples # 54, 57, 60, 67, and 71 (PS=<I.23, 0.22, 0.06, 0.15, and 0.30) with one 

10 sample misclassified. These five samples were all correctly classified using LD with PLS 
gene components with moderate to high conditional class probabilities of 0.97, 1.00, 0,98, 
0.89 and 1.00 respectively. Results for all 72 samples are given in Table 4 and re- 
randomization results, given in Table 5, showed the stability of the estimates. 



15 Table 4: 50 Genes firom Golub et al Predicted (1=ALL, 0=AML) probabilities using 
leave-out-one CV for original 38 training samples and out-of-sample prediction for the 34 
test samples using PLS and PC. PiS is the prediction strength from Golub a/. ForLD, 
71 is an estimate of %=P(J^ Ijdata), and for QDA it is the posterior probability or 
conditional class probability. Samples marked with an * were misclassified. 

20 



Training Data Test Data 

LD QDA LD^ ^ QDA^ 

i yi PS .Ttpz, rfpe n TC „c ^ ^ oh 1^ PC ' Pis PC 



1 


1 


1-00 


■ pa 

1.00 


1.00 


1.00 


PC 

1.00 


39 


1 


0.78 


1.00 


1.00 


... 

1.00 


1.00 


2 


1 


0.41 


1.00 


1.00 


1.00 


1.00 


40 


1 


0.68 


1.00 


1.00 


1.00 


1.00 


3 


1 


0.87 


1.00 


1.00 


1.00 


1.00 


41 


1 


0.99 


1.00 


1.00 


1,00 


1.00 


4 


1 


0.91 


1.00 


1.00 


1.00 


1.00 


42 


1 


0.42 


1.00 


1.00 


1.00 


1.00 


5 


1 


0.89 


1.00 


1.00 


1.00 


1.00 


43 


1 


0.66 


1.00 


1.00 


1.00 


1.00 


6 


1 


0.76 


1.00 


1.00 


1.00 


1.00 


44 


1 


0.97 


1.00 


1.00 


1.00 


1.00 


7 


1 


0.78 


1.00 


1.00 


1.00 


1.00 


45 


1 


0.88 


1.00 


1.00 


1.00 


1.00 


8 


1 


0.77 


1.00 


1.00 


1.00 


1.00 


46 


1 


0.84 


1.00 


1.00 


1.00 


1.00 


9 


1 


0.89 


1.00 


1. 00 


1.00 


1.00 


47 


1 


0.81 


1.00 


1.00 


1.00 


1.00 


10 


1 


0.56 


1.00 


1.00 


1.00 


1.00 


48 


1 


0.94 


1.00 


1.00 


1.00 


1.00 


11 


1 


0.74 


1.00 


1.00 


1.00 


1.00 


49 


1 


0.84 


1.00 


1.00 


i.do 


1.00 


12 


1 


0.20*+ 


1.00 


0.02* 


1.00 


0.00* 


50 


0 


0.97 


0.00 


0.00 


0.00 


0.00 


13 


1 


1.00 


1.00 


1.00 


1.00 


1.00 


51 


0 


1.00 


0.00 


0.00 


0.00 


0.00 


14 


1 


0.73 


1.00 


1.00 


1.00 


1.00 


52 


0 


0.61 


0.00 


0.01 


0.00 


0.00 


15 


1 


0.98 


1.00 


1.00 


1.00 


1. 00 


53 


0 


0.89 


0.00 


0.00 


0.00 


0.00 


16 


1 


0.95 


1.00 


1.00 


1.00 


1.00 


54 


0 


0.23+ 


0.03 


1.00* 


0.00 


0.15 


17 


1 


0.49 


1.00 


1.00 


1.00 


1.00 


55 


1 


0.73 


1.00 


1.00 


1.00 


1.00 


18 


1 


0.59 


1.00 


1.00 


1.00 


1.00 


56 


1 


0.84 


1.00 


1.00 


1.00 


1.00 
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Training Data 
LD QDA 



Test Data 
LD QDA 



z 


yi 


DC" 


A 


^ DC 


A 

^ pis 


7t PC 


19 


1 


0.80 


1.00 


1.00 


1.00 


1 f\t\ 

1.00 


20 


1 


0.90 


1.00 


1.00 


1.00 


1.00 


21 


1 


0.76 


1.00 


1.00 


1.00 


i f\r\ 

1.00 


22 


1 


037 


1.00 


1.00 


1.00 


t f\f\ 

1.00 


23 


1 


0.77 


1.00 


1.00 


1.00 


1.00 


24 


1 


0.82 


1.00 


1.00 


1.00 


1.00 


25 


1 


0.43 


1.00 


1.00 


1.00 


1.00 


26 


1 


0.89 


1.00 


1.00 


1.00 


1.00 


27 


1 


0.82 


1.00 


1.00 


1.00 


1.00 


28 


0 


0.44 


0.00 


0.00 


0.00 


0.00 


29 


0 


0.74 


0.00 


0.22 


0.00 


0.00 


30 


0 


0.80 


0.00 


0.00 


0.00 


0.00 


31 


0 


0.61 


0.00 


0.00 


0.00 


0.00 


32 


0 


0.47 


0.00 


0.00 


0.00 


0.00 


33 


0 


0.89 


0.00 


0.00 


0.00 


0.00 


34 


0 


0.64 


0.00 


0.00 


0.00 


0.00 


35 


0 


0.21+ 


0.00 


1.00* 


0.00 


1.00* 


36 


0 


0.94 


0.00 


0.00 


0.00 


0.00 


37 


0 


0.95 


0.00 


0.00 


0.00 


0.00 


38 


0 


0.73 


0.00 


0.00 


0.00 


0.00 



Yr PS 



pit 



It 



TZ 



#OOTtect I 38 36 38 36 



57 0 

58 0 

59 1 

60 0 

61 0 

62 0 

63 0 

64 0 

65 0 
66 
67 
68 



70 
71 



0 
1 
1 

69 1 



72 1 



0.22+ 

0.74 

0.68 

0.06+ 

0.40 

0.58 

0.69 

0.52 

0.60 

0.27*+ 

0.15*+ 

0.80 

0.85 

0.73 

0.30+ 

0.77 - 



0.00 

0.08 

1.00 

0.02 

0.35 

0.00 

0.00 

0.00 

0.00 

0.93* 

0.89 

1.00 

1.00 

1.00 

1.00 

1.00 



1.00* 

0.00 

1.00 

1.00* 

1.00* 

0.63* 

0.98* 

0.27 

0.21 

1.00* 

1.00 

1.00 

1.00 

1.00 

1.00 

1.00 



0.00 

1.00* 

1.00 

0.00 

1.00* 

0.00 

0.00 

0.00 

0.00 

1.00* 

1. 00 

1.00 

1.00 

1.00 

1.00 

1.00 



0.03 

0.01 

1.00 

0.68* 

0.02 

0.00 

0.00 

0.01 

0.00 

0.99* 

020* 

1.00 

1.00 

1.00 

1.00 

1.00 



33 



27 



31 



31 



Table 5: Results from re-randomizations using the 50 genes obtained by Gohib e 
Given are average classification rate fi»m all re-randomizations (36 training('36 
5 samples splitting). 





PLS 


LD 
PC 


PLS 


QDA 
PC 


Training Data 
Test Data 


99.56 
95.94 


96.44 
94.17 


99.56 
96.44 


97.00 
95.44 



3.1.3. Example 3 — Lymphoma Data 

The lymphoma data set was published by AHzadeh et al. (2000) and consists of 
10 gene expressions from cDNA experiments involving three prevalent adult lymphoid 
malignancies: diffuse large B-cell lymphoma CT>LBCL"), B-cell chronic lymphocytic 
leukemia ("BCLL") and follicular lymphoma ("FL"). Each cDNA target was prepared 
fixjm an experimental mRNA sample and was labeled with Cy5 (red fluorescent dye). A 
reference cDNA sample was prepared from a combination of nine different lymphoma 
15 cell lines and was labeled witii Cy3 (green fluorescent dye). Each Cy5 labeled target wa 
combined with the Cy3 labeled reference target and hybridized onto the microarray. 
Separate measurements were taken from the red and green channels. We analyzed the 
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Standardized log relative intensity ratios, namely the log(Cy5/Cy3) values. To test the 
binary classification procedures of the present invention, we analyzed a subset of the data 
consisting of 45 DLBCL and 29 BCLL samples with p = 4^27 genes. 

Using leave-out-one CV, each sample was predicted to be DLBCL or BCLL 
5 based on 3 gene components constmcted from = 50,100, 500 and 1000 genes. The 
results are given in Table 6. Of the 74 total samples, PLS gene components resulted in 
either one or two misclassificatiohs at most. The two misclassified samples, # 33 and 51, 
were consistently misclassified. PCs did not perfomied as well using LD, with at most 
four misclassifications. However, PCs used with QDA performed similar to PLS 
10 components. 

Table 6: Classification results for DLBCL and BCLL lymphoma samples. Given are the 
number of correct classification out of 74 samples (45 DLBCL and 29 BCLL samples). 
Samples misclassified are given in parenthesis on the right side of the table. 

15 



p* 


LD 




QDA 






Saniple(s) Misclassified 






PLS 


PC 


PLS 


PC 


PLS 


PC 


PLS 


PC 


50 


72 


73 


73 


72 


(33,51) 


(51) 


(51) 


(33,51) 


100 


72 


71 


72 


73 


(33.51) 


(9,33,51) 


(33,51) 


(51) 


500 


72 


71 


73 


"73 


(33, 51) 


(9,45,51) 


(51) 


(45) 


1000 


72 


70 


73 


73 


(33, 51) 


(9, 32,48, 


(51) 


(51) 














51) 







As with the analysis of the leukemia data, we turned next to re-randomization 
studies to assess the stability of the classification results. Table 7 summarizes the results 
20 of 100 re-randomizations (with 37/37 random split). For this data set, PLS components in 
LD appear to perform best for leave-out-one CV ( of the training data sets). Out-of- 
sample prediction results for PLS and PCs are similar. On average, classification of the 
training samples using leave-out-one CV is nearly 100% correct and about less than two 
misclassifications out of 37 for test samples. 

25 

Table 7: Classification results for re-randomization study of the lymphoma data set with 
37/37 splitting. Each value in the table is the correct classification percentage averaged 
over 100 re-randomizations. Perfect classification is 37. 

Training Data Test Data 

(Leave-out-one-CV) (Out-of-sample) 
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LD 




QDA 




LD 




r\T\ A 




p* 


PLS 


PC 


PLS 


PC 


PLS 


PC 


PLS 


PC 


50 


36.87 


36.26 


36.64 


35.60 


35.57 


36.03 


35.88 


35.81 


100 


36.86 


36.38 


36.74 


36.30 


35.84 


36.29 


36.03 


36.10 


500 


36.89 


35.15 


36.77 


35.99 


35.76 


35.21 


35.69 


35.85 


1000 


36.83 


34.94 


36.90 


35.68 


35.58 


33.87 


35.32 


35.12 



3.1.4. Example 4 — Colon Data 

Alon et al. (1999) used Aflfymetrix oligonucleotide arrays to monitor expressions 
5 of over 6500 human genes with samples of 40 tumor and 22 normal colon tissues. Using 
a clustering algorithm based on the deterministic-annealing algorithm, Alon et al, 
clustered the 62 samples into two clusters. One cluster consisted of 35 tumor and 3 
normal samples (n8, nl2, n34 [the labeling for the 22 normal tissues in Alon et al are not 
in consecutive order]). The second cluster contained 19 normal and 5 tumor tissues (T2, 

10 T30, T33, T36, T37). (See Figure 4 of Alon et al) Furey et al. (2000) did leave-out-one 
CV prediction of the 62 samples using SVM and six tissues were misclassified, namely 
(T30, T33, T36) and (n8, n34, n36). As Furey et al pointed out, the three misclassij5ed 
tumors (T30, T33, T36) were among the five tumor samples which clustered into the 
normal group by Alon et al Also, two normal samples (n8, n34) misclassified by Furey 

15 et aL were among the three normal samples clustered into the tumor group by Alon et aL 
Classificatioh of tumor and normal colon tissues using the methods of the 
invention is displayed in Table 8. We carried out analyses for p*^ = 50,100, 500 and 1000. 
PLS gene components in LD for 50 and 100 genes performed best with four 
misclassifications. For = 50 genes T2, Til, T33, and n36 were misclassified. Foip* = 

20 100 genes Til, T30, T33, and nl 1 were misclassified Note that with the exception of 
Til and nl 1 the samples misclassified here also were misclassified using SVM and by 
clustering. Note fi:om Table 8 that the PCs analysis is not competitive relative to PLS 
components for this data set We also note that gene expression patterns for this data set 
is quite heterogeneous. Further, tiie samples that are most commonly misclassified by 

25 various methods of analysis have expression patterns that are quite different firom their 
respective groups. 

Table 8: Classification results for normal and colon tissue samples. Given are the number 

of correct classification out of 62 samples (40 tumors and 22 normal samples). 

30 ^ 

LD QDA 
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PLS PC__ PLS PC 

50 58 54 . 57 . 54 
100 58 53 56 52 
500 56 53 57 53 
1000 57 52 56 54^ 



3.1.5. Example 5 --NCISO Data 

The NC 1 60 Data set, pubKshed by Ross et al (2000), consists of samples from . 
5 human tumor cell lines. The data is from 60 cDNA airays each contaimng 9,703 spotted 
cDNA sequences. The cDNAs arrays contain approximately 8,000 miique genes in 60 
human cell lines obtained from various tumor sites: 7 breast, 5 central nervous system 
("QMS"), 7 colon, 6 leukemia, 8 melanoma, 9 non-small-cell-lung-carcinoma 
CTSISCLC'O, 6 ovarian, 2 prostate, 9 renal, and 1 unknown. The reference sample used in 

1 0 all hybridizations was prepared by combining an equal mixture of mRNA from 12 of the 
cell lines. As with the lymphoma (cDNA data) we analyzed the standardized log relative 
intensity ratios, namely the log(Cy5/Cy3) values. To illustrate our binary classification 
procedures to this cell lines gene expression data, we selected two of the largest groups: 
9 NSCLC and 9 renal samples. Using a subset of 6,814 genes we applied dimension 

15 reduction methods to the selected = 50,100,500 and 1000 genes. The results are given 
in Table 9. PLS gene components predicted all NSCLC and renal cell lines samples 
correctly 100% in all instances. For each analysis, PCs misclassified only one sample, 
either sample 4 or 15. The expression patterns of these two misclassified samples are 
quite different from their respective groups. 

20 

Table 9: Classification results for NSCLC and renal cell lines. Given are the number of 
correct classification out of 18 samples (9 NSCLC and 9 renal samples). 



p* 


LD 

PLS 


PC 


PLS 


QDA 

PC 


■ Sample 
Misclassified 


50 


18 


17 


18 


18 


#4 


100 


18 


17 


18 


18 


#4 


500 


18 


18 


18 


17 


. #15 


1000 


18 


18 


18 


17 


#15 



25 
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3.L6. Example 6- A Condition Wlten PCs Fail to Predict Well 

The conditions under which PLS predicts well have not yet been fblly 
characterized in the statistics or chemometrics literature. la this exanaple we illustrate a 
condition when PCs fiiil to predict well, but PLS components continue to predict well. 
5 The example given is based on the leukemia data set of Golub et ah (1 999). 

In the analyses given above, although the results for PLS components were better 
than that for PCs, the results for PCs were competitive nonetiieless. Examining the 
objective critmon of PLS and PCA, we noted earlier fliat it would be reasonable to expect 
predictions based on PCs to be similar to that from PLS if the predictors (e.g., genes) are 

10 highly predictive of response {e.g,^ leukemia) classes. This is the case of the analyses 
based on the 50 predictive genes reported by Golub et al^ for instance. However, to see 
when PCs feil to predict well, while PLS components succeeded, we considered their 
prediction ability based only on expressed genes, but not exclusively expressed 
differentially for leukemia classes. This test condition is based on the simple feet that an 

1 5 e?^ressed gene does not necessarily qualify as a good predictor of leukemia classes. For 
instance, consider a gene highly expressed across all samples, ALL and AML. In this 
case, the gene will not discriminate between ALL and AML welL We dejSne iBve nested 
data sets consisting of all genes expressed on (A) at least one array (p = 1,554), (B) 25% 
' (p = 1, 076), (C) 50% (p = 864), (D) 75% (p = 662) and (E) 100% (p = 246) of the arrays. 

20 Note that these genes are expressed but not necessarily differentially expressed for 

ALL/AML. As before, we applied PLS and PCA to extract three gene components from 
these five data sets based on the 38 trainiag samples. Predictions of the 38 training 
samples were based on leave-out-one CV and predictions of the 34 test samples were 
based on the training components only. 

25 The results are given in Table 10. As can be seen, the decrease in performance of 

PCs relative to PLS is drastic compared to the result of the 50 predictive genes (Tables 4 
and 5). To check the stability of flie results in Table 10, we ran 50 re-randonuzations. 
The results are given in Table 1 1 . PCs did much worse relative to PLS gene components 
in the re-randomizations as well. 

30 

Table 10: Logistic discrimination and quadratic discrimiaant analysis of original (38 
training/34 test samples splitting) based on class prediction using leave-out-one CV for 
training data set and out-of-sample prediction for test data set. The five data sets consist 
of expressed genes, but not all are differentially expressed for AML/ALL. 

35 
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% correct, train 






% correct, test 


Gene Set 


PLS 


PC 


PLS 


PC 


LD 


Set A 


100.00 


84.21 


91.18 


73.53 


SetB 


100.00 


81.58 


91.18 


73.53 


SetC 


100.00 


84.21 


91.18 


73.53 


SetD 


100.00 


81.58 


91.18 


73.53 


SetE 


100.00 


76.32 


79.41 


64.71 


QDA 


Set A 


100.00 


84.21 


91.18 


82.35 


SetB 


100.00 


84.21 


94.12 


82.35 


SetC 


100.00 


81.58 


91.18 


82.35 


SetD 


100.00 


81.58 


91.18 


88.24 


SetE 


100.00 


57.89 


71.05 


50.00 



Table 11: Average classification rates from 50 re-randomizations (36 traiiimg/36 test 
samples splitting) and prediction using leave-out-one CV for training data sets and out-of- 
sample prediction for test data sets. 





% correct, train 






% correct, test 


Gene Set 


PLS 


PC 


PLS 


PC 


LD 


Set A 


99.67 


86.67 


93.78 


82.00 


SetB 


99.94 


87.44 


94.39 


83.00 


SetC 


99.83 


89.94 


93.89 


83.83 


SetD 


99.78 


85.00 


94.78 


83.39 


SetE 


97.44 


73.89 


89.22 


69.00 


ODA 


Set A 


100.00 


83.44 


93.17 


82.39 


SetB 


99.94 


86.33 


94.28 


85.00 


SetC 


99.72 


88.28 


93.50 


85.17 


SetD 


99.78 


85.00 


94.78 


83.39 


SetE 


98.06 


67.28 


89.61 


64.89 



The result here is not surprising since PCA aims to sunnnarize only the variation 
of the p gmes. However, only a subset of p expressed genes is predictive of leukemia 
classes. Why then do PLS components still perform well in this mixture of expressed 
genes, both predictive and-non-predictive of leukemia classes? This is most likely 
attributed to the choice of objective criterion used, namely covariation between the 
leukemia classes and (the linear combination of) the p genes. Since PLS conqjonents are 
obtained from maximizing cov(Xw, y) it is more able to assign patterns of weights to the 
genes which are predictive of leukemia classes. 
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Further indication of this condition, where PCs fail to predict leukemia classes 
while PLS components succeeded, can be found in tiie table of response (leukemia 
classes) and predictor (genes) variation accounted for by the extracted gene components. 
For example, Table 12(a) smnmarizes variation explained by the constructed PLS 
components and PCs for gene set A (p = 1, 554), Note that three (K== 3) PLS 
components explained 93.8% of response variation and about 58.7% of predictor 
variability, compared to three PCs explaining 55.3% and 60.4% respectively. Thus, the 
total gene variability accoimted by PCs and PLS components are similar, but PCs were 
unable to account for much of the leukemia class variation. Note also that the first PC 
accounted for 44.5% of total predictor (gene) variability but it accounted for only 2.4% of 
total response (leukemia class) variability. This is an indicator that it will poorly predict 
the leukemia classes, as it indeed did (Tables 10 and 11). Now consider the same analysis 
but with the 50 informative genes. This is given in Table 12(b). This time, the first PC • 
accounted for about 46.3% of predictor variability but also accoimted for 84.9% of 
response (leukemia class) variation-tins is a notable increase from 2.4% to 84.9%. 

Table 12: Variability Explained by PLS components and PCs. The nimiber of 
components extracted is K, 

K Predictor Response 

Proportion Cumulative Proportion Cumulative 



(a) Gene set A. 


















PLS 






1 


26.4713 


26.4713 


50.0156 


50.0156 




2 


27.1942 


53.6655 


26.0319 


76.0475 




3 


5.0562 


58.7217 


17.7467 
PC 


93.7942 




1 


44.4644 


44.4644 


2.3520 


2.3520 




2 


10.5679 


55.0323 


38.2658 


40.6177 




3 


5.3219 


60.3542 


14.6836 


55.3014 


(b) 50 Predictive 












genes. 








PLS 






1 


46.2635 


46.2635 


86.1931 


86.1931 




2 


14.7372 


61.0006 


3.4223 


89.6154 




3 


7.2307 


68.2314 


4.4394 

PC 


94.0548 




1 


46.3143 


46.3143 


84.9414 


84.9414 




2 


19.3407 


65.6549 


0.7407 


85.6821. 




3 


5.3636 


71.0185 


0.1557 


85.8377 
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4. BINARY CLASSIFICATION CONCLUSIONS AND DISCUSSIONS 

We have introduced statistical analysis methods for the classification of tumors 
based on microarray gene expression data. The methodologies involve dimension 

5 reduction of the high /^-dimensional gene expression space followed by logistic 

classification or quadratic discriminant analysis. We have also illustrated the methods' 
effectiveness in predicting normal and tumor samples as well as between two different 
tumor types. The samples varied fiom human tissue samples to cell lines generated from 
both one and two channels microarray systems; such as oligonucleotide and cDNA 

10 arrays. The methods are able to distinguish between normal and tumor samples as well as 
between two types of tumors from five different noicroairay data sets with high accuracy. 
Furthermore, these results hold under re-randomization studies. Finally, we have also • 
illustrated a condition under which PLS components are superior to PCs in prediction. 
The problem of distinguishing normal from tumor samples is an important one. 

1 5 Another problem of interest is ia characterizing multiple types of tumors. A data set 

illustrating this multiple classification problem is the NCI 60 data set, which contains nine 
types of tumors. The problem of multiple classification based on gene expression data is 
much more difficult than the problem of binary classification illustrated the preceding 
examples. The method of multivariate PLS (Hoskuldsson, 1988; Garthwaite, 1994) is 

20 usefiil for this problem as illustrated in the following section. 

The PLS method can be of use for gene expression analysis in other contexts as 
well. Predicting the expressions of a target gene based on the remaining mass of genes is 
one example. Here, PLS is used to reduce the dimension of the predictors and then 
multiple linear regression (or another prediction method for continuous response) is used 

25 to predict the expressions of the target gene. Quantifying the predicted gene expression 
values such that they axe compatible with some clinical outcomes is of practical value. 

Another related problem which is amenable to analysis using the methods of the 
invention include assessing the relationship between cellular reaction to drug therapy and 
their gene expression pattern. For example, Scherf e/ al, (2000) assessed growth 

30 inhibition from tracking changes in total cellular protein (in cell lines) after drug 
treatment. Here, the response of cell lines to each drug treatment are the response 
variables, y. Associated with the cell lines are their gene expressions, p. Since the 
expression patterns are from those of untreated cell lines, Scherf 5/ ah focused on the 
relationship between gene expresision patterns of the cell lines and their sensitivity to drug 
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tiierapy. This relationship can he studied via a direct application of the univariate or 
multivariate PLS methods of the invention, which can handle the high dimensionality of 
the data. 

Another example, in cancer research, is the prediction of patient survival times 
based on gene expressions. For example, Ross et al (2000) compared patient survival 
duration with germinal center B-like DLBCL compared to those with activated B4ike 
DLBCL using Kaplan-Meier survival curves (Kaplan and Meier, 1958). These groups 
were determined by gene expression analysis. A more general and useful approach is to 
model the observed survival (and censored) times, j;, as a function of the/? gene 
expressions. A common tool widely used for this purpose is the proportional hazard 
regression proposed by Cox (1972). Again, straight-forward application of this method is 
not possible since JSr< p. Hence, dimension reduction is needed, however, care is needed 
to address the observed censored times. Our work indicates that the PLS methods of tha 
invention are of use in this context as well. 

4.L Multi-Class Classification 

4.1.L Example 7 Hereditary Breast Cancer Data 

Hedenfalk and co-workers (2001) studied gene expression patterns in hereditary 
breast cancer. In particular, many cases of hereditary breast cancer are attributed to 
individuals with a mutant BRCAl or BRCA2 gene. Breast cancers with BRCAl or 
BRCA2 mutation have pathologically distinct featiires (e.g., high mitotic index, 
noninfiltrating smootii edges and lymphocytic indSltrate, grade level; see Hede nfa lk et aL, 
p.539-540). Furfliermore, distinctive features of BRCAl and BRCAl cancers are used to 
distinguish them from sporadic cases of breast cancers. Previous experimental evidence 
indicates that generally cancers with BRCAl mutation lacks both estrogen and 
progesterone receptors but these hormones receptors are present in those with BRCA2 
mutations (Karp et a/., 1997; Johannsson et al, 1997; Loman et al, 1998; Verhoog et al,, 
1998). Also, functional BRCAl and BRCA2 proteins are involved in the repairing of 
damaged DNA, hence, cells with the mutant genes have decreased ability to participate in 
DNA repair. 

Hedenfelk et al (2001) monitored the global expression patterns of 7 cancers with 
BRCAl mutation, 8 vn!&iBRCA2 mutation, and 7 sporadic cases of primary breast 
cancers iising cDNA microairays. {See Table 1, p.543 of Hedenfelk et al for a simrmary 
of the characteristics of all 22 samples.) There were 6, 512 cDNA used which represent 
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5,361 unique genes. Among the 5,361 genes 2,905 are known and 2,456 are unknown 
genes. Selected for analysis were ^ = 3,226 genes and these are available publicly. 

The varied phenotypes and paftways to cancer formation induced by BRCAl and 



used to predict 5i?C4/-mutation-positive versus jffiJCli-mutation-negative. This would 
be done by pooling the 8 samples with BRCA2 mutation with the 7 sporadic cases of 
breast cancer into one group. Similarly, the 7 samples with BRCAl mutation can be 
pooled with the 7 sporadic samples into one group to make class prediction for BRCA2- 
mutation-positive versus negative. However, such "one-versus-all** classification is not 
completely satisfactory since distinct diEferences between all three classes (BRCAl- 
mutation, BRCA2-mutation, and sporadic) is expected at the measured mKNA level. • 
Thus, we considered multi-class cancer classification methods to predict each sample as a 
breast cancer with BRCAl mutation, BRCA2 mutation or as sporadic breast cancer based 
on the observed gene cTqjression profiles of the samples belonging to the three cancer 
classes. 

Preliminary ranking and selection of the genes for analysis was carried out as 
described earlier in the multivariate gene selection section. Thenmnberof genes with 0, 
1, 2, or 3 pairwise absolute mean differences exceeding the critical score is 2269, 541, 
405, or 1 1 respectively. Thus, of the 3,226 genes 2,269 showed no pairwise absolute 
mean difference and only 1 1 genes showed all 3 pairwise differences. Note that 405 
genes showed 2 pairwise differences, however, this does not mean that they can not 
discriminate amongst the three cancer classes. Taken together, these genes present a 
global expression pattern that can be used to discriminate among the three cancer classes. 
The subset of genes selected for analysis is denoted by J3*. We considered two analyses 
based on = 1 1 (genes with all 3 pairwise differences) and p* = 416 (genes with at least 
2 pairwise differences). 

We applied multivariate PLS and PCA to reduce the dimension from = 1 1 or 
j7* = 416 toK-3 MPLS gene components and 3 PCs respectively. All analyses were 
based on standardized log expression ratios. Prediction of each of the iV = 22 samples as 
BRCAl, BRCA2, or as sporadic was carried out using PD and QDA based on the 
constructed gene components. Prediction results were based on leave-out-one cross- 
validation (CV). 



BRCA2 mutation suggest that the gene expression patterns of breast cancer samples 
between these mutations or lack thereof may be distinct In the framework of 
classification or class prediction, one can ask whether the gene expression patterns can be 
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The results are summarized in Table 13. PD using MPLS gene components 
coirectly classified all 22 samples using eitJier /?* =^ 11 orp* = 416 genes. For;?* = 416, 
MPLS components in QDA and PCs in PD also correctly predicted all samples into their 
cancer classes. For this data set MPLS gene components performed better than PCs in 
bofli PD and QDA. 

Table 13: Hereditary breast cancer data. = 22, m = 7 {BRCA1\ = 8 (BRCA2), and 
= 7 (sporadic). Given are the number of misclassification out ofN= 22 samples and in 
parenthesis are the samples misclassified with superscript 1, 2 and s indicating BRCAl, 
BRCA2 and sporadic respectively. 





PD 

MPLS PCA 


MPLS 


QDA 1 
PCA 


# Pairwise Absolute 
Mean Difference 


11 

416 


0 ICtflS") 
0 0 


2(#2', 15^) 
0 


3(ffi', \3\ 21") 
2(#16*. 20*) 


3 

>2 



An interesting sporadic sample misclassified by PCs using QDA (p* = 416) is 
sample 20. When classifying all samples as either 5jRG4i-mutation-positive versus 
negative (binary classification) Hedenfelk et al. misclassified this sporadic sample as 
having ^ BRCAl mutation. We obtained similar results using the binary classification 
methods of the invention described above. Studies have suggested that abnormal 
methylation of the promoter region is indicative of inactivation oi^^ BRCAl gene 
(Catteau et al^ 1999; Esteller et al, 2000); therefore, such samples show similar 
phenotypes as samples with BRCAl mutation. Thus, such samples are potential 
candidates for misclassification when using data at the molecular level. However, 
expression patterns (or lack thereof) of an inactivated gene is not identical to that of a 
mutated gene. It is reasonable that if one looks at a large class of genes simultaneous 
subtle expression patterns emerges which can be predictive of cancer classes. 

4.L2. Example 8 - NC160 Data: CeU Lines Derived from Various Cancer 

Sites 

The NCI60 data set was introduced earlier {see Figure 1) to illustrate the process 
of dimension reduction. This data set is from Ross et al (2000) and Scherf et al 
(2000). The data is ftom 60 cDNA arrays each containing 9,703 spotted cDNA 
sequences. The cDNAs arrays contain approximately 8,000 unique genes in 60 human 
cell lines obtained from various cancer sites. The reference sample used in all 
hybridizations was prepared by combining an equal mixture of mKNA fiom 12 of the cell 
lines. For illustration of the application of the multi-class classification methods to cancer 
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classijBcation, we considered classification of 6 cancer types: leukemia (ni = 6), colon 
(n2 = 7), melanoma ("wj = 8), renal (714 ^ 8), and CNS (Irij = ^j. 

We analyzed the standardized log relative intensity ratios, namely the 
log(Cy5/Cy3) values. Specifically, we used a subset of 1,376 genes and 40 individually 
5 assessed targets (p = 1, 416) analyzed by Scherf et ah (2000) relative to drug activities of 
the same cell lines, which is publicly available. For this data set there are some missing 
gene expression values. Genes with 2 or fewer missing values (out of 35) were included 
for analysis by replacmg the (1 or 2) missing values with the median of the gene's 
expression. This resulted in a subset of 1,299 genes which we used for analysis. 

1 0 Applying the preliminary gene ranking procedure resulted in the following 

ranking of the genes: 167 (0), 76 (1), 115 (2), 119 (3), 266 (4), 148 (5), 241 (6), 109 (7), 
53 (8), 5 (9), 0 (10). That is, 167 genes showed no pairwise absolute mean difference, 76 
genes showed 1 pahrwise differrace, etc. We pooled aU genes showing at least 8 pairwise 
differences (p* = 58) and also all genes showing at least 7 pairwise differences (/?* = 167) 

1 5 for analysis. As before dimension reduction via MPLS and PCA and classification using 
PD and QDA were then used to predict the cancer class of each sample. 

The classification results based on leave-out-one CV are displayed in Table 14. 
Withp* = 58 genes 3 MPLS gene components and PCs correctly classified all cancer 
classes using PD. Tlffee MPLS gene components constructed from = 167 genes also 

20 correctly classified all cancer classes with PD, These components are plotted in Figure 1, 
which illustrates dimension reduction for NCI60 data. In Figure 1 the NCI60 data, the 
"original" gene e^qxression data set used here is X35 ^ \gi and K—2> PLS gene components 
are constructed giving T35 x 3 = [ti, ta, tz\. The 3-dimensional PLS gene components plot, 
illustrates the separability of the cancer classes: leukemia=*, colon==o, melanoma=+, 

25 renal=x, and CNS=0. As shown in Table 14, QDA did not perform as well as PD, with 
one misclassification when using MPLS gene components (both = 58 and 167). This 
Gorranonly misclassified sample (#14), a melanoma sample, is marked in Figure 1 
(bottom) and it can be seen that the sample does not group with the other melanoma 
samples. 



30 



35 



Table 14: NCI60 data:- 5 cancer classes. N^3S,n\^6 (leukemia), 722 = 7 (colon), 713 = 
8 (melanoma), = 8 (renal) and 715 = 6 (CNS). Given are the number of misclassification 
out of iV= 35 samples and in parenthesis are the samples misclassified with superscript le, 
CO, me, re, and cn mdicating leukemia, colon, melanoma, renal, and CNS respectively. 
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P 



MPLS 



PD 



PCA 




QDA 



# Pairwise Abs. 
Mean Difference 



58 
167 



0 
0 



3 (#29", 31*", 
34") 



0 



1(#14"0 3(#1", 14"*, 30") . 
ICM'^ 5(#14~ 26".29",31'", 
34*^ 



>8 



>7 



( * 

10 



15 



; 20 



25 



41 J.5. Example 9 — Lymphoma Data 

The lymphoma data set was published by Alizadeh et al (2000) and consists of 
gene expressions from cDNA experiments involving thiee prevalent adult lymphoid 
maUgnancies: difftise large B-cell lymphoma CT)LBCL"), B-cell chronic lymphocytic 
leukemia CTBCLL^ and folUcular lymphoma CTL"). Each cDNA target was prepared 
from an experimental mKNA sample and was labeled with CyS (red fluorescent dye). A 
reference cDNA sample was prepared from a combination of nine different lymphoma 
cell lines and was labeled with CyS (green fluorescent dye). Each CyS labeled target was 
combined with the Cy3 labeled reference target and hybridized onto the microarray. 
Separate measurements were taken from the red and green channels. We analyzed the 
standardized log relative iatensity ratios, namely the log(Cy5/Cy3) values. 

The lymphoma data set consists ofN^ 83 samples of three cancer classes: 45 are 
DLBCL, 29 are BCLL and 9 are FL. Previously we tested binary classification using 
analogous dimension reduction and classification melhods on this data set using only the 
two largest groups (DLBCL and BCIX) (Nguyen and Rocke, 2001). Now, we consider 
multi-class cancer classification of all 3 classes simultaneously. We analyze a subset of 
the data consisting ofp = 4,15 1 genes. Preliminary ranking and selection of the p genes 
were performed as described in the multivariate gene selection section. The procedure 
resulted in 2,168 genes with 0 pairwise absolute mean difference, 1,003 with 1, 896 with 
2, and 84 with all 3 pairwise absolute mean expression difference. 

Using leave-out-one CV, each sample was predicted to be DLBCL, BCLL, or FL 
based on 3 gene componmts constmcted fcom = 84 genes (with aU 3 pairwise mean 
differences) and = 980 genes (with at least 2 pairwise mean differences). The results 
are given in Table 15. For PD MPLS gene components performed better than PCs with 
two misclassifications (97.6%), However, for this data set QDA performed best with 
only one misclassification (98.8%). A BCLL sample (#51) was misclassified by all (eight 
combinations) of the methods. MPLS gene components performed better than PCs for;?* 
= 84 and the results are equal for p* = 980. 
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Table 15: Lymphoma data: N - 83, m = 45 (DLBCL), m = 29 (BCLL), = 9 (FL). 
Given are the number of misclassiflcation out of J\r= 83 samples and in parenthesis are 
flie samples misclassified with superscript D, B and F indicating DI3CL, BCLL and FL 
respectively. 



p* 


PD 

MPLS PCA 


QDA 

MPLS 


PCA 


# Pairwise Abs. 
Mean Difference 


84+ 
980 


2 5 
4 4 


3 
1 


6 
1 


>2 
3 


MPLS-PD 

PCA-PD 

MPLS-QDA 

PCA-QDA 


p* = ZA 
(#9^,51*) 

(#9^,11^, 18^,51-^,55^ 

(#5^11^,51^ 

(#9^. 1 1^, 18^. 51* 55*, 75^) 


MPLS-PD 
PCA-PD 
MPLS-QDA 
PCA-QDA 


p* = 980 

(#9^, 32^, 48^, 51") 
(#9^, 32^, 48*. 51*) 
(#51*) 
(#51*) 



+ Model valhout intercept 



4.1.4. Example 10^ Acute Leukemia Data 

The data set used here is the acute leukemia data set published by Golub et aL 

10 (1999). The original training data set consisted of 38 bone marrow samples with 27 acute 
lymphoblastic leukemia ("ALL") and 1 1 acute myeloid leukemia ("AML") (from adult 
patients). The independent (test) data set consisted of 24 bone marrow samples as well as 
10 peripheral blood specimens from adults and children (20 ALL and 14 AML). It has 
been noted that global expression patterns of T-cell ALL ("T-ALL") and B-cell ALL ^B- 

15 ALL") are distinct and can be used to differentiate between the two sub-classes of ALL 
(Golub et aL, 1999). Thus, for multi-class cancer discrimination we pooled the two data 
sets to obtain N = 72 samples with three cancer classes: (1) AML (ni = 25), (2) B-ALL 
(>J2 = 38) and (3) T-ALL (W3 = P). 

The gene expression intensities were obtained from Asymetrix high-density 

20 oligonucleotide microarrays containing probes for 6,817 genes. We log transformed the 
gene expressions to have a mean of zero and standard deviation of one across samples. 
For the subsequent analyses we used a subset of p = 3,490 genes. As in the analyses of 
the previous data sets we first ranked the genes. The procedure resulted in 1,945 genes 
with 0 pairwise absolute mean difference, 732 with 1,719 with 2, and 84 with all 3 

25 pairwise absolute mean expression differences. 

As before, using leave-out-one CV, each sample was predicted to be AML, B- 
ALL, or T-ALL based on 3 gene components constmcted from = 94 genes (with all 3 
pairwise mean differences) and /?* == 813 genes (with at least 2 pairwise mean 
differences). The results are given in Table 16. Classification methods compared 
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similarly as for the lymphoma data set. Best classification results come fi^om QDA with 
MPLS components constructed fiom = 813 genes (all correct) and from = 94 genes 
(1 incorrect). In all eight analyses combined there were 4 samples which were 
misclassified: two B-ALL (# 12, 17), one AML (#66), and one T-ALL (#67). 

Table 16: Acute leukemia data: N « 72, m = 25 (AML), = 38 (B-ALL), and = 9 
(T-ALL). Given are the number of misclassification out of N— 72 samples and in 
parenttiesis are the samples misclassified with superscript^, B and T indicating AML, B- 
ALL and T-ALL respectively. 



p* 




PD 




QDA 


# Paiiwise Absolute 




MPLS 


PCA 


MPLS 


PCA 


MeanDifBsence 


94 


4 


4 


1 


3 


>2 


813 


3 


4 


0 


2 


3 



j7* = 94 j7* = 813 

MPLS-PD (#12^ 17^, 66^, 675 MPLS-PD (#17'', 66^, 67") 

PCA-PD (#12^ 17^. 66^ 670 PCA-PD (#12^, 17^ 66^ 67') 

MPLS-QDA (#12^ MPLS-QDA (none) 

PCA-QDA (#12^ 66-^, 67^) PCA-QDA (#12^ 67^) 



4.LS. Example J J — Simulation Studies 

We have tested the proposed methodologies for mult-class cancer classification 
on four gene expression data sets. To fiulher study the performance of the proposed 
methodologies we designed a simulation model and procedure for simulating gene 
expression data. The proposed methodologies are applied the simulated data to assess the 
relative performance. The simulation model presented here is for multi-class but a similar 
simulation was carried out for the binary classification (Nguyen and kocke, 2000). More 
details can be found there. 

Simulation Model and Procedure 

It is sensible in dimension reduction techniques (such as PCA) to use the total 
variability to describe a given data set Certain physical mechanisms, such as DNA 
microarray technology, seem to generate data with a few underlying fiictors or 
components that explain a large amount of the total variability. The simidated data 
matrices are generated to mimic this physical process. For instance, if it is assumed that 
the data have only a few underlying components then the data matrix X generated 
should reflect this observation. For flexibility in comparing the performances of 
various statistical methods, however, the data matrix X is generated so that the first few 
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PCs account for a specified proportion of total variability. We then generated data with 
a spectrum of total variability ranging from 30% to 90%, which encompasses total gene 
variability of nearly all real gene expression data that we have observed. 

We now describe the method of generating the data. The ith row of the x p data 
matrix is generated as follows. Generate 

= TiiT\ + • • • + r^ira + €i. i = 1,.. - . jJV 

(15) 

or ^ii = EA=im^i + e^,- (i = \ " >Pi where 
Tfs = (7]bi, - . . ^-Hsp)' (A: = 1^ • . - , cO, ei =: 

(Cii, . . . yCipY .g ^ vector of i.i.d. noises, and ^^'^^ " ** * ^ is a set of constants. 
We used the models 

<Sy- ^,i•AI^(l), of). 

(16) 

Elements of "Qie zth row of X is obtained as 

Xij = exp(a;|^) j = 1, • . . ,p- 

(17) 

This model was proposed in Nguyen and Rocke (2000) and a study of the choices 

of /*T^ and ratio of standard deviation ^^/^''"as well as details of simulation 
parameters were discussed there. 

After the generation of the data matrix X the true probabilities are generated 
according the to polychotomous regression model» 

fCi = C7r(Ql:jC€),^(l|Xi),.,. M^M) where 
(18) 

The true coefBcient vector are assume fixed, but are actually generated from a 
N fO, or^ ) (jjgtribution and the simulation parameter ^ . are chosen in conjunction with 
t*e*f^ and ratio of standard deviation '^t/^r. (The effect of these simulation parameters 
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are discussed and given in Nguyen and Rocke (2000).) The response variable 7 is 

generated according to the vector ^ of true probabilities. That is, for classes 0,1, . . . , G 
generate the G x 1 multinomial vector, 

%i ^ MlLtt(l, TTf), i = 1, . , - yN. 



Note that is a vector with a single entry of one and the rest are zeros. If (Jc 
= 0,1; . . . , is one in the Jfctti entry then the observed response is yi = h Thus, for iV 
10 random samples • • - > ^Jv) obtain the response vector 

We generated 100 data sets each of size NxptoiiN'== 60 and various^ = 100, 
300, 500, 800, 1000, 1200, 1400 and 1 600 (total of 800 iST datasets). Four sets of 
simulation (total 3,200) datasets are generated so that three PCs explain about 30%, 50%, 

15 70% and 90% of total predictor variabihty: = (Ai + A2 + Wp). (The actual 

datasets generated achieved the percentages of 27%; 47%, 70% and 90%.) The generated 
response variable contains four groups (G = 3). For each data set, three multivariate PLS 
components and PCs were extracted and classification was performed using (1) PD (2) 
QDA with direct resubstitution and (3) QDA with leave-out-one cross-validation 

20 OLachenbruch and Mickey; 1968). The results are summarized iu the next section, 

4.1.7. Simulation Results 

We first compare classification using multivariate PLS components and PCs using 
. (1) PD, (2) QDA with direct resubstitution, and (3) QDA with leave-out-one cross- 
validation. Although the nominal percraitage level of correct classification is lower in the 

25 multiple class setting than the binary case the general results are similar. Classification 
using multivariate PLS components out performs classification using PCs. As the 
percentage of total predictor variability accounted for by tiie extracted PCs increases 
.(aveCA, 3) 27%,47%,70%,90%) classification based on PCs improves. Despite the 
improvement MPLS out performs PCs; m feet the performance of MPLS appears not to 

30 be influenced by the percentage of total predictor variability accounted for by the 

extracted PCs. The results are summarized in Figure 2. Figure 2 illustrates PD using 
MPLS ( 0 ) components & PCs (-♦-)• Percentage of correct classification usmg PD with 



5 

(19) 
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MPLS components and PCs, Each row (of plots) correspond to percentage correct 
classification > 80%, >70%, >60%,>50%. Each column is the percentage of total 
predictor variation accounted for by three PCs, ave (X^3). The x-axis is the number of 
variables p (1 p = 100, 2 -> p = 300, 3 -j^ p = 500, 4 -> p 800, 5 p = 1,000, 
5 = 1,200, 7 p = 1,400, 8 p = 1,600). The y-axis gives the number out of 100 datasets 
generated with percentage of correct classification > 80%, > 70%, > 60%, > 50%. Thus, 
MPLS components appear to perform better than PCs using PD under Ihe simulation 
model. This is also true with the QD A mettiod using direct resubstitution (Figure 4) as 
well as QDA using leave-out-one cross-validation (Figure 3). Figure 3 illustrates QDA 

1 0 with leave-out-one CV using MPLS components and PCs. Each row (of plots) 

correspond to percentage correct classification > 80%, >70%, >60%,>50%. Each 
column is the percentage of total predictor variation accoimted for by three PCs, ave (^3). 
Thex-axis is the number of variables p (1 —►p = 100,2-* p' = 300,3 ->p = 500,4-»'p 
= 800, 5 -H- p = 1,000, 6 = 1,200, 7 p = 1,400, 8 p = 1,600). The y-axis gives 

15 the number out of 100 datasets generated with percentage of correct classification > 80%, 
> 70%, > 60%, > 50%. Figure 4 illustrates QDA with direct-resubstitution using MPLS 
components and PCs. Each row (of plots) correspond to percentage correct classification 
>80%, >70%, >60%,>50%. Each column is the percentage of total predictor 
variation accounted for by three PCs, ave (X,3)- The x-axis is the number of variables p 

20 (1 -> p = 1 00, 2 p = 300, 3 p = 500, 4 p = 800, 5 p = 1 ,000, 6 p = 1,200, 7 
— > p = 1,400, 8 — ^ p = 1,600). The y-axis gives the munber out of 100 datasets generated 
with percentage of correct classification > 80%, > 70%, > 60%, > 50%. As expected, 
QDA with direct resubstitution did better than QDA using cross-validation. For a given 
real dataset, direct resubstitution in QDA gives inflated level of correct classification and 

25 a better indicator is to use cross-validation OLachenbruch and Mickey, 1968). 

For direct comparison of tiie performance of MPLS components under the 3 
different classification methods, we re-plotted only the MPLS components using (1) PD, 
(2) QDA-direct resubstitution, and (3) QDA leave-out-one CV in Figure 5. Figure 5 
illustrates MPLS components in PD (-o-), QDA-direct-resubstitution (-*-), and QDA-CV (- 

30 +-). The Figure compares the percentage of correct classification using PD and QDA 
(direct-resubstitution, and leave-out-one CV) with MPLS components. Each row (of 
plots) correspond to percentage correct classification > 80%, > 70%, > 60%, > 50%. 
Each colxrmn is the percentage of total predictor variation accounted for by three PCs, ave 
(X,3). The X-axis is the number of variables p (1 p = 1 00, 2 ^ p = 300, 3 p = 500, 4 

35 ^ p = 800, 5 p = 1,000, 6 p = 1,200, 7 p = 1,400, 8 p = 1,600). The y-axis 



-39- 



wo 02/25405 




PCTAJSOl/29731 



gives the number out of 100 datasets generated with percentage of correct classification > 
80%, > 70%, > 60%, > 50%.. PD performed better than QDA generally, but not always. 
In a few instances QDA with MPLS con^onents outperformed PD or at least was well. 
This consistent witii classification results fix)m real data reported here for multi-class as 
well as for binary classification. 

5. MULTI-CLASS CLASSIFICATION CONCLUSIONS AND DISCUSSIONS 

We have described multi-class cancer classification methods that are extension of 
ttie binary classification methods described above. The methodologies utilize dimension 
reduction methods to handle high dimensional data such as the large number of genes in 
microairay data. Gene components constructed via MPLS performed well with PD 
and^or QDA As in the binary case explored earlier, gene components extracted via PCA 
did not perform as weU. This was conJSrmed in the application of the methods using PLS, 
MPLS and PCS to 4 cancer data sets as well as to data generated fix»m the simulation 
model for gene expression data. Although the methods were applied to data sets with 
various cancers, the classification methods proposed here are general and can be applied 
in other classification settings for high dimensional biological data as well, as are 
suggested in the specification. For example, gene expression data from various stages 
of a particular cancer may be used to predict, e,g^, patient survival, drug sensitiyity of 
the tumor, or other clinical outcomes. 

An advantage of the methodologies proposed is that other classification methods 
can be utilized (other than PD and QDA) after dimension reduction via MPLS. As 
discussed in the Appendix, numerical methods are needed to obtain the MLE in PD and 
the existence of the MLE depends on the data configin^tion. One disadvantage of using 
PD is when there is quasi-complete separation in the data. As one of ordinary sldll is 
aware, detection of quasi-complete separation is numerically burdensome and 
classification is usually poor. (See Appendix for details.) Also, inversion problems can 
be encountered in the Newton-Raphson algorithm when searching for the MLE. One of 
ordinary skill will readily determine how to make use of alternate classification methods 
in the event that difficulties are encountered witii PD or QDA. 

6.0 APPENDIX 
6.1. PLS Algorithm 

The following PLS algorithm is given in Hoskuldsson (1988) and adopted in Garthwaite 
(1994). For details, see also Helland (1988) and Martens andNaes (1989). 
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1. FOR Jt=.l to rf set u to first column of and DO: 

2. w = X'u/(u'u) and scale w to be of unit length. 

3. t = Xw. 

4. c = Y't/(t't) and scale c to be of unit length. 

5. u = Ye and GO TO 6 IF convergence ELSE return to 2. 

6. p=X't/(t't). 

7. 6=u't/(t't). 

8. Residual matrices: X(jfe+i) = Xj^^j - tp' and Y^J^^^) = Y(^) - fete' (with 
X,Y(,)=Y). 

9. END FOR 

6,L1. Likelihood Function for Polychotomous Regression 

To obtain the likelihood jEimction for N independent samples 
(thi3ci),,.. I (&j>r. Sat) mxder the polychotomous regression model we first define some 
notation. Let ^(^» ^) == log[l + 2£i exp(ff*(xi))] rewriting (12) we have 
Mfei^e:* ) =^ e3tp[,(i?fc(xil - - c(xr> Thus, 

(20) 

Also, for the zth observed response value ya corresponding to explanatory values 
(artOra:ii,-- -,iR|pT (and a?io ^ 1) lei = (^^^Oi^i.-. ,^a) be the row vector indicating 
whether 7; is in group ft € That is ^ ~ where /(/4) is the indicator fimction 

for A. If Z is the iV x (G + 1) matrix consisting of rows ^'^^ then S&=0 ^k — ^^ 

TOW sums are one). Using the above notations, the likelihood for JV independent samples 

(ignoring constants) is 



(21) 
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Hence, the log-likelihood is 

(22) 

Using (20) together with ^ for each i; the log-likelihood is 

N 

(23) 



10 



6.1.2. MLE for Polychotomous Regression Using Newton-Rtqthson 

Estimation of ^ is obtained by maximum likelihood estimation (MLE). Iterative 

methods such as the Newton-Raphson method can be used to obtain the MLE p . This 

requires first and second order derivatives of^^\ For convenience let ^ 
15 It is straight forward to obtain 

9Pi (24, 

25) 

20 Thus the derivative of with respect to is 



^1 
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since — c(x2:; f3) = logTr^rtr and dlo^-Xio/d^k '^'^ik'^i* The. score vector is 



sm = 



(26) 



The G(> + 1) squared m/omafto« matrix /(P)= v-// / 

• 1(0) 

Older derivatives of and are given below 

dm 



requires second 



(27, 28) 



10 The asymptotic covariance matrix of the MLE of is the inverse of For 

iterative computation of the MLE using the Newton-Raphson method is it more concise 

to express /(/^) as follows. Define the following iV* x TV diagonal matrices. 



15 



20 



and letting ^"^'^^ ^^*^^) ^ ^i^^^ = ^ ^'W^X.,^^ 

information matrix can be express as 



J cjCpj4-J)voch-i) 



(29) 



For an initial value ^^"^ liie MLE of is obtained iteratively through 
^(i+lj _ ^(t) _,_ J-i{^(*))S(^W)_ iftheNewton-Raphson algorithm converges, then 
the vector of coef&cients at convergence is denoted p and it is the MLE of ^. 
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Existence of MLE for Polychotomous Regression Model 

We briefly describe the conditions for existence of the MLE of in the 
polychotomous regression model. The reader is referred to Albert and Anderson (1984) 
for details. Possible data configurations can be categorized into three mutually exclusive 
and exhaustive groups: (1) complete separation, (2) quasicomplete separation; and (3) 
overlap. The first two situations lead to parameter estimates often referred to as "infinite 
parameters." Specifically for (1) there exists a vector 8 which coxrectly classil^ all 
observations to their class, i.e. 



for all ^ ^ ^k^ where (7^ (fc = 0, . , . , <?) is an index set identifying all samples in class 
k. Here, the MLE does not exist and the -21og-likelihood decreases to zero. Empirical 
15 detection of complete separation is to stop iteration when the probability of correct 

classification is 1 for all samples. Nearly all model fits with MPLS components reported 
here are of this type. Quasicomplete separation is when there is a vector P such that 



jfor all ^ ^ and the equality holds for at least one {i, kj) (one sample in each class). 
Again, the MLE does not exist for this data configuration. Empirical detection is based 
on monitoring tiie probability of correct classification approaching one and the dispersion 

25 matrix, which is unbounded. This was encountered often with PCs. For the third case, 
overlap, the MLE exist and is unique. 

The foregoing description is intended to illustrate but not limit the invention, the 
scope of which is defined by the claims. Additional embodiments and variations that do 
not depart firom the invention but rather are within the scope of the invention will be 

30 apparent to those skilled in the art in view of the description provided herein. All 
references cited within the specification, including patents, patent applications, and 
scientific publications are hereby incorporated by reference in their entirety for all 
purposes. 
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CLAIMS 

We claim: 

1 1 . A method of classifying a biological sample, comprising: 

2 calculating iT partial least squares components from an Nx p input data set, 

3 wherein JV is a number of samples in the data set, ^ is a number of predictors observed for 

4 the 7/ samples, wherein K<p, and wherein said input data set has for each of the N 

5 samples an associated response variable, y, that identifies one of G groups to which each 

6 of the N samples belongs, 

7 using said K partial least squares components to calculate a set of classification 

8 equations, and 

9 applying said classification equations to a data set obtained from a biological 

1 0 sample to predict which of said G groups the sample belongs to and thereby classify the 

11 sample. 

1 2. The method of claim 1, further comprising determining an estimated 

2 conditional class i>robability of the prediction. 

1 3 . The method of claim 1 , wherein G = 2, said response variable, y, is binary, 

2 and said binary response variable, is used for calculating said impartial least squares 

3 components. 

1 4. The method of claim 3, wherein said classification equations are 

2 calculated using logistic regression. 

1 5. The method of claim 3, wherein said classification equations are 

2 calculated using quadratic discriminant analysis. 

1 6. The method of claim 3, wherein said classification equations are 

2 calculated using linear discriminant analysis. 

1 7. The method of claim 1, wherein said partial least squares components are 

2 modified using singular value decomposition prior to calculating said set of classification 

3 equations. 

1 8. The method of claim 1, wherein said partial least squares components are 

2 modified using linear combinations of univariate logistic regression prior to calculating 

3 said set of classification equations. 

1 9. The method of claim 2, wherein G is an integer greater than 2, said 

2 method further comprising creating (G-1) indicator variables and using multivariate 

3 partial least squares on the vector response of the (G-1) indicator variables to calculate 

4 said K partial least squares components. 
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1 10. The me&od of claim 9, wherein said classification equations are 

2 calculated using polychotomous logistic regression. 

1 11. The method of claim 9, wherein said classification equations are 

2 calculated using quadratic discriminant analysis. 

1 12. The method of claim 9, wherein the classification equations are calculated 

2 using linear discriminant analysis. 

1 13. The method of claim 1, wherein said input data set and said data set 

2 obtained Scorn a biological sample comprise gene expression measurements. 

1 14. The method of claim 3, wherein said G groups are tumor and normal 

2 groups. 

1 15- The method of claim 9, wherein said G groups include different tumor 

2 types. 

1 16. The method of claim 9, wherein said G groups include predicted survival 

2 times. 

1 17. The method of claim 9, wherein said G groups include different ceUular 

2 reactions to drug therapy. 

1 18. The method of claim 1 , wherein said input data are nonnalized to have a 

2 mean of zero and a standard deviation of one. 

1 19. . The method of claim 1, wherein said input data set and said data set 

2 obtained fiom a biological sample compiise ratios between a reference and a test 

3 measurement. 
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